Data 621 Homework 4

Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 8000 records, each representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is binary. A “1” indicates that the customer was in a car crash while 0 indicates that they were not. The second response variable is TARGET_AMT. This value is 0 if the customer did not crash their car. However, if they did crash their car, this number will be a value greater than 0.

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict whether a customer will crash their car and to predict the cost in the case of crash. We will only use the variables given to us (or variables that we derive from the variables provided).

Below is a short description of the variables of interest in the data set:

Data Exploration

Data Exploration

The dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous.

Data summary
Name data_train
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
factor 14
numeric 12
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
INCOME 0 1 FALSE 6613 $0: 615, emp: 445, $26: 4, $48: 4
PARENT1 0 1 FALSE 2 No: 7084, Yes: 1077
HOME_VAL 0 1 FALSE 5107 $0: 2294, emp: 464, $11: 3, $11: 3
MSTATUS 0 1 FALSE 2 Yes: 4894, z_N: 3267
SEX 0 1 FALSE 2 z_F: 4375, M: 3786
EDUCATION 0 1 FALSE 5 z_H: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 z_B: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_USE 0 1 FALSE 2 Pri: 5132, Com: 3029
BLUEBOOK 0 1 FALSE 2789 $1,: 157, $6,: 34, $5,: 33, $6,: 33
CAR_TYPE 0 1 FALSE 6 z_S: 2294, Min: 2145, Pic: 1389, Spo: 907
RED_CAR 0 1 FALSE 2 no: 5783, yes: 2378
OLDCLAIM 0 1 FALSE 2857 $0: 5009, $1,: 4, $1,: 4, $4,: 4
REVOKED 0 1 FALSE 2 No: 7161, Yes: 1000
URBANICITY 0 1 FALSE 2 Hig: 6492, z_H: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁

Data Processing


Fix data types

We noticed that a few variables that are listed as discrete have large numbers of unique values. A closer inspection of the variable descriptions reveals that that while these variables are encoded as factors they are actually continuous. The TARGET_FLAG variable also appears in the summary as numeric variable, but it should be a binary factor. We proceed to fix these data types.

Fix bad and missing values

Also, there are some values that seem invalid (i.e. -3 CAR_AGE). Since both variables the missing values are less than 5% then we can replace the missing values with the median. We Will take the median on the training set only and impute in both training and testing to avoid overfitting.

Univariate charts

We now explore the distribution of TARGET_FLAG across the numeric variables. We see that BLUEBOOK, INCOME, OLDCLAIM have a high number of outliers compared to other variables. We also see that customers with who are older, or have older cars, higher home values, higher income tend to get into fewer car crashes. However, people with motor vehicle record points or high number of old claims tend to get into more accidents.

The variables dislayed below need scale transformations like OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL. AGEhas a guassian distribution. We see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.

Correlation

We see MVR_PTS, CLM_FREQ, and OLDCLAIM are the most positively correlated variables with our response variables. Whereas, URBANICITY is the most negatively correlated variable. Rest of the variables are weakly correlated.

Centrality Measures and Outliers

As was previously noted this distribution has a long tail. The mean payout is $5616 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.


Data Preparation

Sampling


   0    1 
6008 2153 

There is an imbalance in the TARGET_FLAG variable

Let’s check the class distribution


        0         1 
0.7361843 0.2638157 

The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.

To treat this unbalance, we would use the over sampling

check the balance again


   0    1 
6008 6008 

Influential Leverage Points


Model Building - Logit Models

# A function to extract the relevant metrics from the summary and confusion matrix
score_model <- function(id, model, data, output=FALSE) {
  if (output) print(summary(model))
  glm.probs <- predict(model, type="response")
  # Confirm the 0.5 threshold
  glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
  results <- tibble(target=data$TARGET_FLAG, pred=glm.pred)
  results <- results %>%
    mutate(pred.class = as.factor(pred), target.class = as.factor(target))
  
  if (output) print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
  
  acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
  sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
  spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
  #prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
  res.deviance <- model$deviance
  null.deviance <- model$null.deviance  
  aic <- model$aic
  metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec)
  metrics <- lapply(metrics, round, 3)
  
  if (output) plot(roc(results$target.class,glm.probs), print.auc = TRUE)
  model.df <- tibble(id=id, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance, 
                         aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity)
  model.list <- list(model=glm.fit, df_info=model.df)
  return(model.list)
}

Model 1:

Model 2: Basic Logit Models

We construct null, full and reduced models. The reduced model is created via stepwise regression.


Call:
glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
    data = mod2_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5846  -0.7127  -0.3983   0.6261   3.1524  

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -9.286e-01  3.215e-01  -2.889 0.003869 ** 
KIDSDRIV                         3.862e-01  6.122e-02   6.308 2.82e-10 ***
AGE                             -1.015e-03  4.020e-03  -0.252 0.800672    
HOMEKIDS                         4.965e-02  3.713e-02   1.337 0.181119    
YOJ                             -1.105e-02  8.582e-03  -1.288 0.197743    
INCOME                          -3.423e-06  1.081e-06  -3.165 0.001551 ** 
PARENT1Yes                       3.820e-01  1.096e-01   3.485 0.000492 ***
HOME_VAL                        -1.306e-06  3.420e-07  -3.819 0.000134 ***
MSTATUSz_No                      4.938e-01  8.357e-02   5.909 3.45e-09 ***
SEXz_F                          -8.251e-02  1.120e-01  -0.737 0.461416    
EDUCATIONBachelors              -3.812e-01  1.157e-01  -3.296 0.000981 ***
EDUCATIONMasters                -2.903e-01  1.788e-01  -1.624 0.104397    
EDUCATIONPhD                    -1.677e-01  2.140e-01  -0.784 0.433295    
EDUCATIONz_High School           1.764e-02  9.506e-02   0.186 0.852802    
JOBClerical                      4.107e-01  1.967e-01   2.088 0.036763 *  
JOBDoctor                       -4.458e-01  2.671e-01  -1.669 0.095106 .  
JOBHome Maker                    2.323e-01  2.102e-01   1.106 0.268915    
JOBLawyer                        1.049e-01  1.695e-01   0.619 0.535958    
JOBManager                      -5.572e-01  1.716e-01  -3.248 0.001161 ** 
JOBProfessional                  1.619e-01  1.784e-01   0.907 0.364168    
JOBStudent                       2.161e-01  2.145e-01   1.007 0.313729    
JOBz_Blue Collar                 3.106e-01  1.856e-01   1.674 0.094158 .  
TRAVTIME                         1.457e-02  1.883e-03   7.736 1.03e-14 ***
CAR_USEPrivate                  -7.564e-01  9.172e-02  -8.247  < 2e-16 ***
BLUEBOOK                        -2.084e-05  5.263e-06  -3.959 7.52e-05 ***
TIF                             -5.547e-02  7.344e-03  -7.553 4.26e-14 ***
CAR_TYPEPanel Truck              5.607e-01  1.618e-01   3.466 0.000528 ***
CAR_TYPEPickup                   5.540e-01  1.007e-01   5.500 3.80e-08 ***
CAR_TYPESports Car               1.025e+00  1.299e-01   7.893 2.95e-15 ***
CAR_TYPEVan                      6.186e-01  1.265e-01   4.891 1.00e-06 ***
CAR_TYPEz_SUV                    7.682e-01  1.113e-01   6.904 5.05e-12 ***
RED_CARyes                      -9.728e-03  8.636e-02  -0.113 0.910313    
OLDCLAIM                        -1.389e-05  3.910e-06  -3.554 0.000380 ***
CLM_FREQ                         1.959e-01  2.855e-02   6.864 6.69e-12 ***
REVOKEDYes                       8.874e-01  9.133e-02   9.716  < 2e-16 ***
MVR_PTS                          1.133e-01  1.361e-02   8.324  < 2e-16 ***
CAR_AGE                         -7.196e-04  7.549e-03  -0.095 0.924053    
URBANICITYz_Highly Rural/ Rural -2.390e+00  1.128e-01 -21.181  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7297.6  on 8123  degrees of freedom
AIC: 7373.6

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5551 1235
         1  457  918
                                          
               Accuracy : 0.7927          
                 95% CI : (0.7837, 0.8014)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3963          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4264          
            Specificity : 0.9239          
         Pos Pred Value : 0.6676          
         Neg Pred Value : 0.8180          
             Prevalence : 0.2638          
         Detection Rate : 0.1125          
   Detection Prevalence : 0.1685          
      Balanced Accuracy : 0.6752          
                                          
       'Positive' Class : 1               
                                          

The summary output of the reduced model retains a number of statistically significant predictors.


Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS + 
    CAR_TYPE + REVOKED + KIDSDRIV + CAR_USE + TIF + TRAVTIME + 
    INCOME + CLM_FREQ + BLUEBOOK + PARENT1 + EDUCATION + HOME_VAL + 
    OLDCLAIM, family = binomial(link = "logit"), data = mod2_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6039  -0.7115  -0.3979   0.6268   3.1440  

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.053e+00  2.559e-01  -4.117 3.84e-05 ***
URBANICITYz_Highly Rural/ Rural -2.389e+00  1.128e-01 -21.181  < 2e-16 ***
JOBClerical                      4.141e-01  1.965e-01   2.107 0.035104 *  
JOBDoctor                       -4.475e-01  2.667e-01  -1.678 0.093344 .  
JOBHome Maker                    2.748e-01  2.042e-01   1.346 0.178234    
JOBLawyer                        9.715e-02  1.692e-01   0.574 0.565740    
JOBManager                      -5.649e-01  1.714e-01  -3.296 0.000980 ***
JOBProfessional                  1.548e-01  1.784e-01   0.868 0.385304    
JOBStudent                       2.751e-01  2.109e-01   1.304 0.192066    
JOBz_Blue Collar                 3.098e-01  1.855e-01   1.670 0.094879 .  
MVR_PTS                          1.143e-01  1.359e-02   8.412  < 2e-16 ***
MSTATUSz_No                      4.719e-01  7.955e-02   5.932 2.99e-09 ***
CAR_TYPEPanel Truck              6.090e-01  1.509e-01   4.035 5.46e-05 ***
CAR_TYPEPickup                   5.503e-01  1.006e-01   5.469 4.53e-08 ***
CAR_TYPESports Car               9.726e-01  1.074e-01   9.054  < 2e-16 ***
CAR_TYPEVan                      6.466e-01  1.221e-01   5.295 1.19e-07 ***
CAR_TYPEz_SUV                    7.156e-01  8.596e-02   8.324  < 2e-16 ***
REVOKEDYes                       8.927e-01  9.123e-02   9.785  < 2e-16 ***
KIDSDRIV                         4.176e-01  5.512e-02   7.576 3.57e-14 ***
CAR_USEPrivate                  -7.574e-01  9.161e-02  -8.268  < 2e-16 ***
TIF                             -5.538e-02  7.340e-03  -7.545 4.53e-14 ***
TRAVTIME                         1.448e-02  1.881e-03   7.699 1.37e-14 ***
INCOME                          -3.486e-06  1.076e-06  -3.239 0.001199 ** 
CLM_FREQ                         1.963e-01  2.852e-02   6.882 5.91e-12 ***
BLUEBOOK                        -2.308e-05  4.719e-06  -4.891 1.00e-06 ***
PARENT1Yes                       4.602e-01  9.427e-02   4.882 1.05e-06 ***
EDUCATIONBachelors              -3.868e-01  1.089e-01  -3.554 0.000380 ***
EDUCATIONMasters                -3.032e-01  1.615e-01  -1.878 0.060385 .  
EDUCATIONPhD                    -1.818e-01  2.002e-01  -0.908 0.363825    
EDUCATIONz_High School           1.487e-02  9.469e-02   0.157 0.875229    
HOME_VAL                        -1.342e-06  3.407e-07  -3.939 8.18e-05 ***
OLDCLAIM                        -1.405e-05  3.907e-06  -3.595 0.000324 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7301.8  on 8129  degrees of freedom
AIC: 7365.8

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5558 1245
         1  450  908
                                          
               Accuracy : 0.7923          
                 95% CI : (0.7833, 0.8011)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3934          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4217          
            Specificity : 0.9251          
         Pos Pred Value : 0.6686          
         Neg Pred Value : 0.8170          
             Prevalence : 0.2638          
         Detection Rate : 0.1113          
   Detection Prevalence : 0.1664          
      Balanced Accuracy : 0.6734          
                                          
       'Positive' Class : 1               
                                          

We compute McFadden’s pseudo R squared for logistic regression and we see that the difference between the full model and the reduced model is only marginal. We proceed with the smaller model.

[1] "Full model = 0.2251"
[1] "Reduced model = 0.2247"

Influential Values

Inspect the influential values

Outliers

Only one outlier. Consider removing?

Multicollinearity

MI: JOB, EDUCATION > 5 too high?

                GVIF Df GVIF^(1/(2*Df))
URBANICITY  1.140628  1        1.068002
JOB        18.426984  8        1.199750
MVR_PTS     1.158224  1        1.076208
MSTATUS     1.863154  1        1.364974
CAR_TYPE    2.636934  5        1.101818
REVOKED     1.313343  1        1.146012
KIDSDRIV    1.090527  1        1.044283
CAR_USE     2.445424  1        1.563785
TIF         1.009108  1        1.004544
TRAVTIME    1.038286  1        1.018963
INCOME      2.468963  1        1.571293
CLM_FREQ    1.464061  1        1.209984
BLUEBOOK    1.767416  1        1.329442
PARENT1     1.428915  1        1.195372
EDUCATION   7.952890  4        1.295882
HOME_VAL    1.850544  1        1.360347
OLDCLAIM    1.645591  1        1.282806

Model 3: Penalized Logistic Model

Since the basic model contained many predictor variables, we take a look at a penalized logistic regression model which imposes a penalty on the model for having too many predictors. We will indentify the best shrinkage factor lambda through cross validation with an 80%/20% train/test data partitioning.

We fit a lasso regression model (alpha=1) and plot the cross-validation error over the log of lambda. The number of predictors are shown on top and the vertical lines represent the optimal (minimal) value of lambda as well as the value of lambda which minimizes the number of predictors but still remains within 1 standard error of the optimal. We will consider models with both of these lamdba values.

[1] "lambda.min = 0.000920870982666798"
[1] "lambda.1se = 0.00782513279990271"

The columns below compare the variables that are dropped in the lasso regression for the optimal and smallest model lambdas.

38 x 2 sparse Matrix of class "dgCMatrix"
                                            1             1
(Intercept)                     -6.256919e-01 -3.642785e-01
KIDSDRIV                         3.844400e-01  3.021601e-01
AGE                             -2.168260e-03 -1.310061e-03
HOMEKIDS                         3.773198e-02  2.751783e-02
YOJ                             -1.310347e-02 -5.695975e-03
INCOME                          -3.336289e-06 -3.688055e-06
PARENT1Yes                       4.312774e-01  4.127180e-01
HOME_VAL                        -1.118899e-06 -1.138745e-06
MSTATUSz_No                      4.639098e-01  3.217145e-01
SEXz_F                           .             .           
EDUCATIONBachelors              -3.487337e-01 -1.074464e-01
EDUCATIONMasters                -2.951971e-01 -5.462985e-02
EDUCATIONPhD                    -1.979495e-01  .           
EDUCATIONz_High School           2.044973e-02  1.039607e-01
JOBClerical                      3.108965e-01  1.532496e-01
JOBDoctor                       -4.312730e-01 -1.550210e-01
JOBHome Maker                    1.408637e-01  .           
JOBLawyer                       -2.746984e-02 -3.500884e-03
JOBManager                      -6.915838e-01 -5.933973e-01
JOBProfessional                  .             .           
JOBStudent                       2.566632e-02  .           
JOBz_Blue Collar                 1.350942e-01  6.494182e-03
TRAVTIME                         1.327734e-02  9.177456e-03
CAR_USEPrivate                  -7.681347e-01 -7.510074e-01
BLUEBOOK                        -2.258725e-05 -1.628431e-05
TIF                             -5.454742e-02 -4.043686e-02
CAR_TYPEPanel Truck              4.915321e-01  .           
CAR_TYPEPickup                   4.951164e-01  1.025761e-01
CAR_TYPESports Car               8.550137e-01  4.169693e-01
CAR_TYPEVan                      5.928614e-01  7.276048e-02
CAR_TYPEz_SUV                    6.450938e-01  2.783844e-01
RED_CARyes                       2.077374e-03  .           
OLDCLAIM                        -1.401987e-05 -1.818601e-08
CLM_FREQ                         1.965300e-01  1.353638e-01
REVOKEDYes                       8.381861e-01  5.709997e-01
MVR_PTS                          1.048640e-01  9.313352e-02
CAR_AGE                         -9.001762e-04 -6.672491e-03
URBANICITYz_Highly Rural/ Rural -2.271618e+00 -1.874310e+00

Predicting TARGET_FLAG

When comparing the accurancy of the two lasso penalized models, we see that the difference in accuracy is marginal.

[1] 0.776824
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1163  326
         1   38  104
                                          
               Accuracy : 0.7768          
                 95% CI : (0.7558, 0.7968)
    No Information Rate : 0.7364          
    P-Value [Acc > NIR] : 9.139e-05       
                                          
                  Kappa : 0.2678          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.24186         
            Specificity : 0.96836         
         Pos Pred Value : 0.73239         
         Neg Pred Value : 0.78106         
             Prevalence : 0.26364         
         Detection Rate : 0.06376         
   Detection Prevalence : 0.08706         
      Balanced Accuracy : 0.60511         
                                          
       'Positive' Class : 1               
                                          
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1183  361
         1   18   69
                                          
               Accuracy : 0.7676          
                 95% CI : (0.7464, 0.7879)
    No Information Rate : 0.7364          
    P-Value [Acc > NIR] : 0.002043        
                                          
                  Kappa : 0.1955          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.16047         
            Specificity : 0.98501         
         Pos Pred Value : 0.79310         
         Neg Pred Value : 0.76619         
             Prevalence : 0.26364         
         Detection Rate : 0.04231         
   Detection Prevalence : 0.05334         
      Balanced Accuracy : 0.57274         
                                          
       'Positive' Class : 1               
                                          

Model Building - Mutiple Regression Models

Predicting TARGET_AMT


Call:
lm(formula = TARGET_AMT ~ ., data = data_train2)

Residuals:
   Min     1Q Median     3Q    Max 
 -8943  -3176  -1501    480  99578 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.396e+03  1.845e+03   1.841   0.0658 .  
KIDSDRIV                        -1.719e+02  3.166e+02  -0.543   0.5873    
AGE                              1.835e+01  2.124e+01   0.864   0.3879    
HOMEKIDS                         2.135e+02  2.071e+02   1.031   0.3028    
YOJ                              1.916e+01  4.918e+01   0.390   0.6969    
INCOME                          -9.014e-03  6.742e-03  -1.337   0.1814    
PARENT1Yes                       2.772e+02  5.873e+02   0.472   0.6369    
HOME_VAL                         2.198e-03  2.020e-03   1.088   0.2767    
MSTATUSz_No                      8.036e+02  4.935e+02   1.628   0.1036    
SEXz_F                          -1.397e+03  6.564e+02  -2.129   0.0334 *  
EDUCATIONBachelors               2.606e+02  6.419e+02   0.406   0.6848    
EDUCATIONMasters                 1.194e+03  1.084e+03   1.102   0.2707    
EDUCATIONPhD                     2.396e+03  1.312e+03   1.827   0.0679 .  
EDUCATIONz_High School          -3.954e+02  5.145e+02  -0.769   0.4423    
JOBClerical                      3.084e+02  1.203e+03   0.256   0.7977    
JOBDoctor                       -2.116e+03  1.762e+03  -1.201   0.2299    
JOBHome Maker                   -2.353e+01  1.266e+03  -0.019   0.9852    
JOBLawyer                        3.272e+02  1.029e+03   0.318   0.7506    
JOBManager                      -7.794e+02  1.066e+03  -0.731   0.4647    
JOBProfessional                  1.061e+03  1.129e+03   0.940   0.3475    
JOBStudent                       1.147e+02  1.286e+03   0.089   0.9289    
JOBz_Blue Collar                 5.206e+02  1.146e+03   0.454   0.6497    
TRAVTIME                         7.528e-01  1.108e+01   0.068   0.9458    
CAR_USEPrivate                  -4.384e+02  5.216e+02  -0.840   0.4008    
BLUEBOOK                         1.245e-01  3.053e-02   4.077 4.73e-05 ***
TIF                             -1.574e+01  4.252e+01  -0.370   0.7112    
CAR_TYPEPanel Truck             -6.403e+02  9.605e+02  -0.667   0.5051    
CAR_TYPEPickup                  -5.637e+01  5.968e+02  -0.094   0.9248    
CAR_TYPESports Car               1.061e+03  7.502e+02   1.414   0.1575    
CAR_TYPEVan                      6.335e+01  7.707e+02   0.082   0.9345    
CAR_TYPEz_SUV                    9.025e+02  6.668e+02   1.354   0.1760    
RED_CARyes                      -1.933e+02  4.965e+02  -0.389   0.6970    
OLDCLAIM                         2.502e-02  2.263e-02   1.105   0.2691    
CLM_FREQ                        -1.154e+02  1.580e+02  -0.731   0.4651    
REVOKEDYes                      -1.125e+03  5.166e+02  -2.177   0.0296 *  
MVR_PTS                          1.108e+02  6.853e+01   1.617   0.1061    
CAR_AGE                         -9.842e+01  4.406e+01  -2.233   0.0256 *  
URBANICITYz_Highly Rural/ Rural -9.778e+01  7.562e+02  -0.129   0.8971    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7690 on 2115 degrees of freedom
Multiple R-squared:  0.03061,   Adjusted R-squared:  0.01365 
F-statistic: 1.805 on 37 and 2115 DF,  p-value: 0.002183

Fit lasso regression model (alpha=1).

Plot of cross-validation error over the log of lambda displays the minimal value.

[1] "lambda.min = 3628.41974018787"
[1] "lambda.1se = 8382.12019963966"

The columns below compare the variables that are dropped in the lasso regression for the optimal and smallest model lambdas.

38 x 2 sparse Matrix of class "dgCMatrix"
                                           1        1
(Intercept)                     5.126988e+03 5699.927
KIDSDRIV                        .               .    
AGE                             .               .    
HOMEKIDS                        .               .    
YOJ                             .               .    
INCOME                          .               .    
PARENT1Yes                      .               .    
HOME_VAL                        .               .    
MSTATUSz_No                     .               .    
SEXz_F                          .               .    
EDUCATIONBachelors              .               .    
EDUCATIONMasters                .               .    
EDUCATIONPhD                    .               .    
EDUCATIONz_High School          .               .    
JOBClerical                     .               .    
JOBDoctor                       .               .    
JOBHome Maker                   .               .    
JOBLawyer                       .               .    
JOBManager                      .               .    
JOBProfessional                 .               .    
JOBStudent                      .               .    
JOBz_Blue Collar                .               .    
TRAVTIME                        .               .    
CAR_USEPrivate                  .               .    
BLUEBOOK                        3.973882e-02    .    
TIF                             .               .    
CAR_TYPEPanel Truck             .               .    
CAR_TYPEPickup                  .               .    
CAR_TYPESports Car              .               .    
CAR_TYPEVan                     .               .    
CAR_TYPEz_SUV                   .               .    
RED_CARyes                      .               .    
OLDCLAIM                        .               .    
CLM_FREQ                        .               .    
REVOKEDYes                      .               .    
MVR_PTS                         .               .    
CAR_AGE                         .               .    
URBANICITYz_Highly Rural/ Rural .               .    
[1] 0
[1] 0

Model 4:

Model Selection